Social network analysis

class notes

Introduction

Networks are complex structures used to represent relations. In math they are studied as graphs, but they can be used in multiple ways. As a way to represent knowledge, to define a model, as database systems, or to represent information. Many machine learning models are built as graphs: Neural networks, decision trees, and even causal models. In this course, we will focus on networks for social science. Here, we will use graphs to represent data about social relations between entities (people, institutions, countries, etc.).

Elements of a network

In order to work with networks, we first need to go through some basic definitions.

\[ G=(V, E) \] Where \(V\) is the set of nodes, and \(E\) the set of links.

\(V\): {A,B,C}, \(E\): {1,2}

Types of networks

Weighted or unweighted networks

The relation between two nodes can either be dichotomous or quantitative. For example, a link can represent co-authorship – given two authors, they either co-authored a paper or not– and for it we might use an unweighted network. But a link can also represent the number of papers two co-authors wrote together, and in this case we might use a weighted network. A weighted network can be represented as an unweighted one for those relations above a given threshold.

Directed or undirected networks

Connections can have a direction5 This changes a little the definition of walks and paths, as they now need to follow the direction of links. Also the distance between two nodes can be asymmetrical.. For example, in Twitter, B can follow A, who follows C, that also follows A back. While in Facebook, the connections are defined as friendship, where there is no direction.

Bipartite networks

Networks can encode complex relations between different types of entities. For example, if we want to represent people that belongs to different institutions, we can have nodes that represent people {\(A_1,A_2,...,A_5\)}, and other type of nodes that represent institutions {\(I_1,I_2,...,I_4\)}, and here the links represent the relation of belonging.

This type of networks can also be simplified using projection, where we keep only one type of node. In the example above, we can relate institutions to each other if they have people in common, and link people if they belong to the same institution:

Dynamic networks

If we want to use network analysis to study a social process, a static representation can fall short to understand how the social interactions evolve over time. For this, we can also think networks as a dynamic process, where new nodes enter, old nodes disappear, and links change over time.

Network representation and metrics

Lets take a small random network and see how it can be represented and which metrics can be used to describe it:

##    1 2 3 4 5 6 7 8 9 10
## 1  0 0 0 0 1 0 1 1 0  0
## 2  0 0 0 0 0 0 0 0 0  1
## 3  0 0 0 1 0 0 0 0 1  0
## 4  0 0 1 0 1 1 0 0 1  0
## 5  1 0 0 1 0 0 1 1 0  0
## 6  0 0 0 1 0 0 1 0 1  0
## 7  1 0 0 0 1 1 0 0 0  1
## 8  1 0 0 0 1 0 0 0 0  0
## 9  0 0 1 1 0 1 0 0 0  0
## 10 0 1 0 0 0 0 1 0 0  0
##       [,1] [,2]
##  [1,]    3    4
##  [2,]    1    5
##  [3,]    4    5
##  [4,]    4    6
##  [5,]    1    7
##  [6,]    5    7
##  [7,]    6    7
##  [8,]    1    8
##  [9,]    5    8
## [10,]    3    9
## [11,]    4    9
## [12,]    6    9
## [13,]    2   10
## [14,]    7   10
## $`1`
## + 3/10 vertices, named, from 9df1f50:
## [1] 5 7 8
## 
## $`2`
## + 1/10 vertex, named, from 9df1f50:
## [1] 10
## 
## $`3`
## + 2/10 vertices, named, from 9df1f50:
## [1] 4 9

Node level metrics

A key concept in network analysis is centrality. It refers to the importance of a node within the network. Generally speaking, if a node is well connected, it can exercise more influence over the network.

Let’s imagine our network is composed of a group of 10 people. Links represent friendship and trust between them. If we want to know how fake news are spread across the network, centrality measures can help us to understand what would happen if any of the nodes start a rumor8 This same kind of analysis can be made for the spread of diseases, political beliefs, among many different research questions..

Who do you think that can be more effective spreading the misinformation? Node 4, 7 or 2?

There are many ways to define the influence of a node, and which one is the best depends on the research question.

## # A tibble: 10 × 2
##     node degree
##    <int>  <dbl>
##  1     1      3
##  2     2      1
##  3     3      2
##  4     4      4
##  5     5      4
##  6     6      3
##  7     7      4
##  8     8      2
##  9     9      3
## 10    10      2

We can see that by degree centrality nodes 4, 5, and 7 are considered equally important.

\[ closeness(i) = \frac{N-1}{\sum_{j\neq i}d_{i,j}} \]

## # A tibble: 10 × 2
##     node closeness
##    <int>     <dbl>
##  1     5     0.067
##  2     7     0.067
##  3     4     0.059
##  4     6     0.059
##  5     1     0.056
##  6     9     0.05 
##  7    10     0.048
##  8     8     0.045
##  9     3     0.042
## 10     2     0.034

When we consider closeness, node 4 is not as important as in degree centrality. As it is not connected to node 7, its distance to nodes 10 and 2 is rather long.

\[ betweenness(i)= \sum_{j,k\neq i}\frac{b_{jik}}{b_{jk}} \]Where \(b_{jk}\) are all the shortest paths between nodes \(j\) and \(k\), and \(b_{jik}\) are all of those shortest paths that go through \(i\).

## # A tibble: 10 × 2
##     node betweenness
##    <int>       <dbl>
##  1     7       16.7 
##  2     5       10.2 
##  3     4        8.83
##  4    10        8   
##  5     6        7   
##  6     1        1.83
##  7     9        1.5 
##  8     2        0   
##  9     3        0   
## 10     8        0

For betweenness, the node 7 is the most important by far, as it is the only bridge to nodes 10 and 2, and one of the two connections between nodes {5, 1, 8} and nodes {6, 9, 4, 3}.

This is a recursive problem, as the centrality can be defined as

\[ x_i = \kappa^{-1} \sum_{\text{nodes j connected to i}}x_j \]

The centrality of the node \(x_i\) is some proportion of the centrality of its neighbors (\(\kappa\) being a proportionality constant).

We can use the adjacency matrix \(A\) to redefine this problem, given that if the nodes are connected, \(A_{ij}\) is 1, and 0 otherwise.

\[ x_i = \kappa^{-1} \sum_{j}^nA_{ij}x_j \] It is called eigen centrality, because the solution to this equation is the eigen vector of the adjacency matrix10 In algebra this is defined as \(Ax=\lambda x\), where \(\lambda\) is the eigen value and \(x\) the eigen vector that solve this equation. For the scope of this class, we can just think that this gives us the solution of our recursive problem..

## # A tibble: 10 × 2
##     node eigen_centrality
##    <int>            <dbl>
##  1     5             1   
##  2     4             0.96
##  3     7             0.91
##  4     6             0.81
##  5     1             0.77
##  6     9             0.72
##  7     8             0.56
##  8     3             0.52
##  9    10             0.32
## 10     2             0.1

When we consider eigen centrality, node 5 is more important than node 7. This happens because node 5 is connected to both nodes 4 and 7, which are the two following nodes in importance, while nodes 4 and 7 are not connected between them.

Network level metrics

Social networks

Random networks

Networks are complex structures. There are no simple tests like in descriptive statistics that we can use to check if a network follows a property, but random networks can be used to compare some properties. Random networks are generative models, where we can specify some parameters like the number of nodes, edges, or the probability of a link between nodes, and a network is built following a specific algorithm.

Preferential attachment model

These models can also help us infer which are the evolutive properties of a network. For example, the preferential attachment model iteratively adds new nodes to the network. This new nodes have a probability to connect with old nodes that is given by the degree of the old nodes. This iterative process recreates the rich gets richer effect.

One type of preferential attachment model is the Barabási-Albert (BA) model. The algorithm of the BA model works as follows:

  1. The network begins with an initial connected network with \(n\) nodes.
  2. New nodes are added to the network, one at a time. Each new node will connect with the previous nodes in the network with the probability:

\[ p_{i}\sim k_i^\alpha \]Where \(k_i\) is the degree of the node \(i\) and \(\alpha\) is a parameter that defines the power of the preferential attachment.

Let’s see how this process looks like:

What effects does a cumulative process have?

Let’s see the degree distribution of such type of networks:

The centrality follows a power-law distribution, where a few nodes have a really high centrality, while the great majority has a low centrality.

Small world

Another property that is commonly found in real-world networks is the small world phenomena: as networks get bigger in number of nodes, the average distance of the network only increases mildly. Huge networks with millions of nodes have very short distances between any pair of nodes, even if it is sparsely connected12 “Six degrees of separation”: is the idea that all people are six or fewer social connections away from each other. Although this number is somehow forced, in Facebook the average distance between any pair of users is 5.73, and in twitter 4.67..

This happens because of a combination of clustering and power-law degree distributions: there are a few highly connected nodes that guarantee the more distant connections, and then the densely connected neighborhoods easily connect the node with that highly connected node in its neighborhoods.

Communities

Detecting communities within a network is a common task in SNA.

Finding Twitter communities, for example, can be the key for understanding the patterns of political polarization (Conover et al. 2011). A community in a network is a sub-graph in which the nodes are densely connected within them, and sparsely connected to nodes outside their community.

It is the same concept as clustering, but the techniques used to detect them are different, given the particular structure of network’s data.

Modularity (\(Q\)) (Newman 2006) is an important metric for community detection. It shows the extent to which the number of links between a group of nodes is greater (\(Q>0\)) or smaller (\(Q<0\)) than expected at random in that graph.

Let’s see how our networks looks like if we build communities using Louvain algorithm:

These are just two examples, but there are many other algorithms, and which one is the best fit depends on the research question.

Homophily

One of the main findings across several studies is the concept of homophily: the idea that people tend to relate more to others that they perceive similar to them in some way.

A great example is the political homophily: people tend to relate more with people that share their political beliefs. For example, the figure below shows the Twitter network of Argentina.

Twitter Argentina (González 2016)

Twitter US also shows a similar split between democrats and republicans.

Twitter US (Ribeiro et al., n.d.)

To measure homophily, we can also use the metric of modularity, but instead of measuring the groups built as a community detection algorithm, we use it with taking the nodes grouped by an attribute attached to each one of them (e.g. democrat or republican).

Discussions

Why does the rich get richer?

The preferential attachments mechanism for building networks is an interesting model to rethink inequalities. The time component allows to think how if a small difference happens repeatedly over time, and has a cumulative nature, it can create an outcome with extreme inequalities. This model also shows that those who enter first to the system of accumulation already have a unbeatable advantage with respect to those that enter afterwards.

Income has such cumulative mechanism. Money makes money. If a group of the population was systematically deprived for generations, even if there seems to be equal opportunities at some point in time, the cumulative nature of income will continue to affect that group. This is why colorblind policies are not enough, and without restorative actions reaching equality is impossible.

Re-discussing homophily

Homophily is one of the biggest sociological ideas that came up from network studies. A classical example has always been the friendship network in a US high school, which shows how black and white kids tend to play with other kids that share their identity (Bearman, Moody, and Stovel 2002). Nevertheless, this concept is also problematic, and without enough care it can foster misleading conclusions.

When we talk about identities, there are privileged and excluded populations. The concept of homophily equas all groups to a single dynamic, when in reality there are different phenomena happening for different groups.

The empirical observation of a high modularity needs to be properly contextualized. An extreme example would be the residential segregation in US. In the 30’s, the Federal Housing Administration made public housing projects exclusively for white people. If we would measure the racial segregation by neighborhoods at that time, we could think that both black and white people showed homophily behaviors. But black people did not had a choice, as they were excluded from white neighborhoods. The reasons behind the behavior of each group are strikingly different, and modularity alone cannot capture that complexity.

References

Bearman, Peter Shawn, James Moody, and Katherine Stovel. 2002. “Chains of Affection: The Structure of Adolescent Romantic and Sexual Networks.” https://doi.org/10.7916/D83R10RS.
Blondel, Vincent D., Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. 2008. “Fast Unfolding of Communities in Large Networks.” Journal of Statistical Mechanics: Theory and Experiment 2008 (10): P10008. https://doi.org/10.1088/1742-5468/2008/10/P10008.
Conover, Michael, Jacob Ratkiewicz, Matthew Francisco, Bruno Goncalves, Filippo Menczer, and Alessandro Flammini. 2011. “Political Polarization on Twitter.” Proceedings of the International AAAI Conference on Web and Social Media 5 (1): 89–96. https://doi.org/10.1609/icwsm.v5i1.14126.
González, Pablo A. 2016. “Jugada preparada | El Gato y La Caja.” https://elgatoylacaja.com/jugada-preparada.
Newman, M. E. J. 2006. “Modularity and Community Structure in Networks.” Proceedings of the National Academy of Sciences 103 (23): 8577–82. https://doi.org/10.1073/pnas.0601602103.
Pons, Pascal, and Matthieu Latapy. 2005. “Computing Communities in Large Networks Using Random Walks.” In, edited by pInar Yolum, Tunga Güngör, Fikret Gürgen, and Can Özturan, 284–93. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer. https://doi.org/10.1007/11569596_31.
Ribeiro, Manoel Horta, Pedro H. Calais, Virgílio A. F. Almeida, and Wagner Meira Jr. n.d. "Everything i Disagree with Is #FakeNews": Correlating Political Polarization and Spread of Misinformation.” https://doi.org/10.48550/arXiv.1706.05924.